Showing 14 Results
Showing 1-10 of 14
Dataset

To view details of each card, click icon

PROJECT: ADMIRRAL
DATASET DESCRIPTION: Molecular dynamics (MD) simulation data for a three-component system [DPPC-DOPC-CHOL]
3k disordered 3-component-system [DPPC-DOPC-CHOL]
3K DDC Mol. Mod.
Short Description:

Molecular dynamics (MD) simulation data for a three-component system [DPPC-DOPC-CHOL]

Long Description:

This dataset contains one file with molecular dynamics (MD) simulation data for a three-component system [DPPC-DOPC-CHOL]. It is the default dataset (3k disordered) used for the P2B1 benchmark. It is a 10 microsecond (μs), coarse-grained bead simulation in Protein Data Bank format containing 3,000 lipids and 3,000 frames. Extracting this 2.87 GB file (3k_run10_10us.35fs-DPPC.10-DOPC.70-CHOL.20.dir.tar.gz) produces a directory called 3k_run10_10us.35fs-DPPC.10-DOPC.70-CHOL.20.dir that contains 29 .npz files, which are the MD simulation data, split into 29 chunks. The first 28 files are 107 MB each; the last is 66 MB. The chunk shape is (100, 3040, 12, 20), the last three dimensions of which correspond to the number of molecules, the number of beads, and the number of features. The data format is [Frames (2900), Molecules (3040), Beads (12), [rel_x, rel_y, rel_z, CHOL, DPPC, DIPC, Head, Tail, BL1, BL2, BL3, BL4, BL5, BL6, BL7, BL8, BL9, BL10, BL11, BL12] (20)]. This is from the BAASiC ADMIRRAL (formerly known as Pilot 2) dataset produced at LLNL (LLNL-MI-724660), produced under contract DE-AC52-07NA27344, and from 11/28/17. For more information, refer to the GitHub Repository (https://github.com/CBIIT/NCI-DOE-Collab-Pilot2-Autoencoder_MD_Simulation_Data). 

VERSION: Version 1
CONTENT TYPE: Molecular Dynamics Simulation Data, Lipid Molecules
CDRP Models & Software
DATASET DESCRIPTION: Collection of metadata and DataFrames used by machine learning models in the Cellular-Level Pilot project to predict drug response in various cancer cell lines
Cancer Drug Response Prediction Dataset
CDRP
Short Description:

Collection of metadata and DataFrames used by machine learning models in the Cellular-Level Pilot project to predict drug response in various cancer cell lines

Long Description:

This dataset contains:

  • DataFrames and supporting metadata used by Combo, Single Drug Response Predictor (formerly P1B3), Uno, UNOMT, CLRNA, and benchmarking machine learning models in the Cellular-Level Pilot project to predict drug response in various cancer cell lines.
  • Gene expression and drug response data for cancer cell lines from the NCI-60 Human Cancer Cell Line Screen (NCI 60), NCI ALMANAC, NCI Sarcoma (SCL), NCI Small Cell Lung Cancer (SCLC), Cancer Cell Line Encyclopedia (CCLE), Genomics of Drug Sensitivity in Cancer (GDSC), Genentech Cell Line Screening Initiative (gCSI), and Cancer Therapeutics Response Portal (CTRP) studies, and molecular descriptors generated using Dragon 7.0 and Mordred software packages.
  • Relevant metadata for the cancer cell lines and drug compounds.
  • A list of genes from the Library of Integrated Network-Based Cellular Signatures (LINCS) 1000 study. The LINCS1000 gene set was used as a reference to filter cancer cell line data.

The TopN DataFrames for the Cellular-Level Pilot combine drug response data, gene expression data, and drug molecular descriptors into a single DataFrame to support building binary classification or regression machine learning models to predict drug response. These DataFrames include top N cancer types that have the most cell lines with the RNA-Seq and drug response data available. The models can be further evaluated and improved by using an empirical method, Learning curves. For more information, refer to the following links.

GitHub repository links:

CLRNA

https://github.com/CBIIT/NCI-DOE-Collab-Pilot1-Semi-Supervised-Feature-Learning-with-Center-Loss

Combo

https://github.com/CBIIT/NCI-DOE-Colab-Pilot1-Combo-combination-drug-response-predictor

Learning Curve

https://github.com/CBIIT/NCI-DOE-Collab-Pilot1-Learning-Curve

Single Drug Response Predictor

https://github.com/CBIIT/NCI-DOE-Collab-Pilot1-Single-Drug-Response-Predictor

Uno

https://github.com/CBIIT/NCI-DOE-Collab-Pilot1-Unified-Drug-Response-Predictor

 

Source links:

Aspuru-Guzik VAE

https://github.com/aspuru-guzik-group/chemical_vae

CCLE

https://portals.broadinstitute.org/ccle/data

CTRP

https://portals.broadinstitute.org/ctrp/

Dose Response AUC

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753377/

GDC

https://portal.gdc.cancer.gov/

GDSC

https://www.cancerrxgene.org/downloads/bulk_download

LINCS1000

http://lincsportal.ccs.miami.edu/dcic-portal/

NCI ALMANAC

https://dtp.cancer.gov/ncialmanac/initializePage.do

NCI PDMR

https://pdmdb.cancer.gov/web/apex/f?p=101:41

NCI Sarcoma

https://sarcoma.cancer.gov/sarcoma/downloads.xhtml

NCI Small Cell Lung Cancer

https://sclccelllines.cancer.gov/sclc/

NCI-60 - CellMiner

https://discover.nci.nih.gov/cellminer/loadDownload.do

NCI-60 - DTP

https://dtp.cancer.gov/databases_tools/bulk_data.htm

gCSI

https://pharmacodb.pmgenomics.ca/datasets/4

 

VERSION: Version 1
CONTENT TYPE: RNA-Seq, Drug Response, Drug Molecular Descriptors, SMILES
CDRP Models & Software
PROJECT: ATOM
DATASET DESCRIPTION: This training set contains 346,780 compounds and their corresponding HOMO-LUMO gaps calculated by density functional level of theory (DFT).
Curated DFT HOMO-LUMO
HOMO-LUMO
Short Description:

This training set contains 346,780 compounds and their corresponding HOMO-LUMO gaps calculated by density functional level of theory (DFT).

Long Description:

This training set contains 346,780 compounds and their corresponding HOMO-LUMO gaps calculated by density functional level of theory (DFT). Chemicals with duplicate CHEMBL IDs, Inchi keys, or standardized smiles have been removed for ease of use with AMPL.

VERSION: Version 1
CONTENT TYPE: HOMO-LUMO gaps
CDRP Models & Software
DATASET DESCRIPTION: Collection of drug MoA information on FDA-approved and anti-cancer drugs
Drug MoA Information
Drug MoA
Short Description:

Collection of drug MoA information on FDA-approved and anti-cancer drugs

Long Description:

This dataset contains drug MoA information on both FDA-approved anti-cancer drugs and investigational drugs/compounds.

  • One text file provides the MoA information of compounds collected from the Drug Repurposing Hub of the Broad Institute. The data have been further processed to include compound name, PubChem ID, Broad Institute ID, SMILES, MoA description, and target gene symbols.
  • The other text file provides the MoA information of compounds/drugs included in the CTRP, GDSC, CCLE, and gCSI drug screening studies. The MoA information is curated from multiple sources and is grouped into categories. Target genes are represented by both gene symbols and Entrez IDs. Drug IDs used by the Cellular-Level Pilot project are also included.
VERSION: Version 1
CONTENT TYPE: Drug Molecular Descriptors, SMILES, Cell Line Drugs
CDRP Models & Software
DATASET DESCRIPTION: Collection of drug molecular descriptor data
Drug Molecular Descriptors
Drug Mol. Descrip.
Short Description:

Collection of drug molecular descriptor data

Long Description:

This dataset contains drug molecular descriptors generated using Dragon 7.0 and Mordred software packages.

  • One file provides the molecular descriptors for the drugs generated using Dragon 7.0 software package, which calculates 5,270 molecular descriptors. They include the simplest atom types, functional groups and fragment counts, topological and geometrical descriptors, three-dimensional descriptors, but also several properties estimation (such as logP) and drug-like and lead-like alerts (such as the Lipinski’s alert). The Dragon 7.0 software package also generates path-based fingerprints (PFP) and extended connectivity fingerprints (ECFP) for drugs.
  • The other file provides the molecular descriptors for the drugs generated using Mordred software package, which calculates 1,826 molecular descriptors.

For more information, refer to the GitHub Repository (https://github.com/CBIIT/NCI-DOE-Collab-Pilot1-Learning-Curve).

VERSION: Version 1
CONTENT TYPE: Drug Molecular Descriptors
CDRP Models & Software
PROJECT: ATOM
DATASET DESCRIPTION: Histamine-1 (H1), Muscarinic Receptors 2 (M2), and hERG binding affinity along with ligand structural data
H1_M2_update_10nM
H1_M2
Short Description:

Histamine-1 (H1), Muscarinic Receptors 2 (M2), and hERG binding affinity along with ligand structural data

Long Description:

This dataset contains one file with Histamine-1 (H1), Muscarinic Receptors 2 (M2), and hERG binding affinity along with ligand structural data.

VERSION: Version 1
CONTENT TYPE: Protein Assay, Ligand Structural Data
CDRP Models & Software
PROJECT: IMPROVE
DATASET DESCRIPTION: The IMPROVE Benchmark Dataset comprises of four kinds of data - cell line response data, cell line multi-omics data, drug feature data, and data partitions.
IMPROVE Benchmark Dataset
IMPROVE
Short Description:

The IMPROVE Benchmark Dataset comprises of four kinds of data - cell line response data, cell line multi-omics data, drug feature data, and data partitions.

Long Description:

The IMPROVE Benchmark Dataset comprises of four kinds of data – 1) cell line response data, 2) cell line multi-omics data, 3) drug feature data, and 4) data partitions.  

1. Cell line response data were extracted from five sources. These are:

  • Cancer Cell Line Encyclopedia (CCLE)
  • Cancer Therapeutics Response Portal version 2 (CTRPv2)
  • Genomics of Drug Sensitivity in Cancer version 1 (GDSC1)
  • Genomics of Drug Sensitivity in Cancer version 2 (GDSC2)
  • Genentech Cell Line Screening Initiative (GCSI)

A unified dose response fitting pipeline was used on the multi-dose viability data to calculate various dose-independent response metrics such as the area under the dose response curve (AUC) and the half-maximal inhibitory concentration (IC50).   

2. The multi-omics data of cell lines were extracted from the Dependency Map (DepMap) portal of CCLE. The types of data included are gene expression, mutation, DNA methylation, copy number variation, protein expression, and miRNA expression. Data preprocessing, such as discretizing copy number variation and mapping between different gene identifier systems, was performed.  

3. Drug information was retrieved from PubChem. Based on the drug SMILES (Simplified Molecular Input Line Entry Specification) strings, we calculated their molecular fingerprints and descriptors using the Mordred and RDKit Python packages.  

4. Data partition files were generated using the IMPROVE benchmark data preparation pipeline. They indicate, for each modeling analysis run, which samples should be included in the training, validation, and testing sets, for building and evaluating the drug response prediction (DRP) models.  

For more details, refer to Benchmark Data for Cross-Study Analysis.

VERSION: Version 0
CONTENT TYPE: Cell Line Response, Cell Line Multi-omics , Drug Feature, Data Partitions
CDRP Models & Software
DATASET DESCRIPTION: Combined DataFrame that includes drug response data, gene expression data, and drug molecular descriptors of top N cancer types
Integrated DataFrames of Most Prevalent Cancer Types - TopN [Top6/Top21]
TopN Cancer Types
Short Description:

Combined DataFrame that includes drug response data, gene expression data, and drug molecular descriptors of top N cancer types

Long Description:

This asset contains five files. The TopN DataFrames for the Cellular-Level Pilot combine drug response data, gene expression data, and drug molecular descriptors into a single DataFrame to support building binary classification or regression machine learning models to predict drug response. These DataFrames include top N cancer types that have the most cell lines with the RNA-Seq and drug response data available. For more information, refer to the following source links:

Source CCLE

https://portals.broadinstitute.org/ccle/data

Source CTRP

https://portals.broadinstitute.org/ctrp/

Source GDSC

https://www.cancerrxgene.org/downloads/bulk_download

Source NCI-60 – DTP

https://dtp.cancer.gov/databases_tools/bulk_data.htm

Source gCSI

https://pharmacodb.pmgenomics.ca/datasets/4

VERSION: Version 1
CONTENT TYPE: RNA-Seq, Drug Response, Drug Molecular Descriptors
CDRP Models & Software
PROJECT: ADMIRRAL
DATASET DESCRIPTION: Molecular dynamics simulation data of membrane interactions of the globular domain and the hypervariable region of KRAS4b
KRAS4b Simulation Data
KRAS4b Sim.
Short Description:

Molecular dynamics simulation data of membrane interactions of the globular domain and the hypervariable region of KRAS4b

Long Description:

This dataset contains molecular dynamics simulation data for the "Membrane interactions of the globular domain and the hypervariable region of KRAS4b define its unique diffusion behavior" paper (https://elifesciences.org/articles/47654). For each individual simulation, the MoDaC asset provides a topology file (.gro), a PDB file (.pdb), and a sample trajectory of 1μs.

VERSION: Version 1
CONTENT TYPE: Molecular Dynamics Simulation Data, KRAS, Protein Modeling
CDRP Models & Software
PROJECT: MOSSAIC
DATASET DESCRIPTION: Collection of pathology reports with the associated site and histology labels downloaded from the Genomic Data Commons Platform at the National Cancer Institute.
ML Ready Pathology Reports
ML Path. Rep.
Short Description:

Collection of pathology reports with the associated site and histology labels downloaded from the Genomic Data Commons Platform at the National Cancer Institute.

Long Description:

This dataset contains 7,187 pathology reports with the associated site and histology labels downloaded from the Genomic Data Commons Platform at the National Cancer Institute.

  • The files in ml_ready_raw_text_pathology_reports.tar.gz were converted from PDF to text using an optical character recognition program (refer to the Tesseract link). An example of a report is available on the GDC archive portal (refer to the GDC link).
  • The file ml_ready_raw_text_histo_metadata.csv contains annotations (such as site and histology) extracted from those reports.

This dataset is used as input to MT-CNN and HiSan (refer to the GitHub Repository links and Model links).

GDC

https://portal.gdc.cancer.gov/legacy-archive/files/a9a42650-4613-448d-895e-4f904285f508

GitHub Repository HiSan

https://github.com/CBIIT/NCI-DOE-Collab-Pilot3-Pathology-Reports-Hierarchical-Self-Attention-Network

GitHub Repository MT-CNN

https://github.com/CBIIT/NCI-DOE-Collab-Pilot3-Multitask-Convolutional_Neural_Network

Model HiSan

https://modac.cancer.gov/searchTab?dme_data_id=NCI-DME-MS01-7565752

Model MT-CNN

https://modac.cancer.gov/searchTab?dme_data_id=NCI-DME-MS01-7330732

Tesseract

https://github.com/tesseract-ocr/

VERSION: Version 1
CONTENT TYPE: Pathology Reports, Genomic Data Commons
CDRP Models & Software